Comparing linguistic interpretation schemes for English corpora
نویسندگان
چکیده
Project AMALGAM explored a range of Partof-Speech tagsets and phrase structure parsing schemes used in modern English corpus-based research. The PoS-tagging schemes and parsing schemes include some which have been used for hand annotation of corpora or manual postediting of automatic taggers or parsers; and others which are unedited output of a parsing program. Project deliverables include: a detailed description of each PoS-tagging scheme, and multi-tagged corpus; a “Corpus-neutral” tokenization scheme; a family of PoS-taggers, for 8 PoS-tagsets; a method for “PoS-tagset conversion”, a sample of texts parsed according to a range of parsing schemes: a MultiTreebank; an Internet service allowing researchers worldwide free access to the above resources, including a simple email-based method for PoS-tagging any English text with any or all PoS-tagset(s). We conclude that the range of tagging and parsing schemes in use is too varied to allow agreement on a standard; and that parserevaluation based on ‘bracket-matching’ is unfair to more sophisticated parsers.
منابع مشابه
A Cross-linguistic and Cross-cultural Study of Epistemic Modality Markers in Linguistics Research Articles
Epistemic modality devices are believed to be one of the prominent characteristics of research articles as the commonly used genre among the academic community members. Considering the importance of such devices in producing and comprehending scientific discourse, this study aimed to cross–culturally and cross-linguistically investigate epistemic modality markers as an important subcategory...
متن کاملDetecting Abstract Linguistic Properties Through the Study of Corpus Data
For obvious reasons, the focus of much corpus linguistic research is on the surface word forms and strings that are available in all electronic corpora. As linguists, however, we are aware that language has structure which is not directly audible/visible on the surface. In order to study that invisible structure more effectively, we have been creating, in collaboration with others, a range of a...
متن کاملGenerating Conceptual Metaphors from Proposition Stores
Contemporary research on computational processing of linguistic metaphors is divided into two main branches: metaphor recognition and metaphor interpretation. We take a different line of research and present an automated method for generating conceptual metaphors from linguistic data. Given the generated conceptual metaphors, we find corresponding linguistic metaphors in corpora. In this paper,...
متن کاملApplication of a Corpus to Identify Gaps between English Learners and Native Speakers
In order to develop effective computerassisted language teaching systems for learners of English as a foreign language, it is first necessary to identify gaps between learners and native speakers in the four basic linguistic skills (reading, writing, pronunciation, and listening). To identify these gaps, the accuracy and fluency in language use between learners and native speakers should be com...
متن کاملInferring language change from computer corpora: Some methodological problems1
As the number and size of computer corpora grow, linguistic researchers are increasingly using them to study changes in language over time. Comparing usage at one point in time with usage at a later or an earlier period seems a stunningly simple and Sausurreanly impeccable method of studying language change. Needless to say the reality is rather different. This paper identifies some of the meth...
متن کامل